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Abstract 

Background: Accurate determination of genetic ancestry is of high interest for many areas such as biomedical 
research, personal genomics and forensics. It remains an important topic in genetic association studies, as it has 
been shown that population stratification, if not appropriately considered, can lead to false-positive and -negative 
results. While large association studies typically extract ancestry information from available genome-wide SNP 
genotypes, many important clinical data sets on rare phenotypes and historical collections assembled before the 
GWAS area are in need of a feasible method (i.e., ease of genotyping, small number of markers) to infer the 
geographic origin and potential admixture of the study subjects. Here we report on the development, application 
and limitations of a small, multiplexable ancestry informative marker (AIM) panel of SNPs (or AISNP) developed 
specifically for this purpose. 

Results: Based on worldwide populations from the HGDP, a 41 -AIM AISNP panel for multiplex application with the 
ABI SNPlex and a subset with 31 AIMs for the Sequenome iPLEX system were selected and found to be highly 
informative for inferring ancestry among the seven continental regions Africa, the Middle East, Europe, Central/ 
South Asia, East Asia, the Americas and Oceania. The panel was found to be least informative for Eurasian 
populations, and additional AIMs for a higher resolution are suggested. A large reference set including over 4,000 
subjects collected from 120 global populations was assembled to facilitate accurate ancestry determination. We 
show practical applications of this AIM panel, discuss its limitations for admixed individuals and suggest ways to 
incorporate ancestry information into genetic association studies. 

Conclusion: We demonstrated the utility of a small AISNP panel specifically developed to discern global ancestry. 
We believe that it will find wide application because of its feasibility and potential for a wide range of applications. 

Keywords: Ancestry Informative Markers, Multiplex, Global Ancestry, Population Stratification, Admixture, AISNP, 
AIMS 



Background of evolutionary forces such as mutation, genetic drift, mi- 

Characterization of human ancestry has been of interest gration and natural selection, the assessment of the gen- 
for decades as information about population structure can etic background in individuals chosen for a study is crucial 
provide novel insight into the human past and remains an in genetic epidemiology [1]. 

important topic in the rapidly evolving biomedical field. While still a topic of controversy [2], there is ample 
For example, because genetic variants conferring risk to a evidence that self-reported race, as for example used in 
particular disease may be geographically restricted because the US Census, can predict ancestral clusters in a popu- 
lation sample. However, it does not completely inform 
on how genetic variation is apportioned within and be- 
tween racial groups, nor does information on race reveal 
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Especially in the context of mapping disease genes, 
more objective and accurate methods of defining homo- 
genous populations for the investigation of specific 
population-disease associations are required This is not 
only paramount for specific mapping approaches such as 
admixture mapping [4], but has also been recognized as 
a crucial prerequisite for genetic association studies, as 
the presence of undetected population structure can lead 
to both false-positive results and failures to detect genu- 
ine associations [5]. Furthermore, it has been shown that 
the consequences of population structure on association 
outcomes increase markedly with sample size, and even 
modest levels of population structure within population 
groups cannot safely be ignored in the large studies 
needed to detect typical genetic effects in common dis- 
eases [6]. 

In order to assess genetic background diversity, a large 
number of ancestry informative marker (AIM) panels have 
been developed for particular applications. Genome-wide 
panels for admixture mapping have been developed 
for Hispanic populations [7], African Americans [8] or 
three-way admixture in the Americas [9], and smaller 
AIM panels have been designed to discern ancestry at 
either the global level [10-12] or within specific popu- 
lations such as the Native and Mexican Americans 
[13-15], Europeans [16-20] or African Americans [21,22]. 
In addition, genome-wide association studies (GWAS) are 
able to leverage ancestral information from the allele fre- 
quencies of the several thousand SNPs generated for 
whole-genome applications, alleviating the need for spe- 
cific AIM panels [5]. 

However, determining ancestry and controlling for 
population structure is just as important in smaller 
genetic association studies. These include for example 
candidate gene studies involving only a few genetic 
markers, replication of GWAS findings, or consist of 
smaller, highly valuable collections of rare patho- 
logical phenotypes and historical collections with lim- 
ited amounts of DNA. Genotyping these samples on 
large AIM panels or leveraging ancestry information 
from preexisting genotyping is often not practical or 
possible. 

To address this specific need, we set out to develop a 
highly informative AIM panel that would allow us to 
infer a subjects ancestral origin at the continental level 
and estimate admixture proportions among at least 
seven main geographic regions Africa, the Middle East, 
Europe, Central and South Asia, East Asia, Oceania and 
the Americas. The selection of such AIMs has to focus 
on SNPs with the largest allele frequency differences be- 
tween the continental regions of interest to achieve the 
desired resolution at the continental level. Such high 
resolution is required because genetic diversity of human 
populations follows gradients or geographic clines within 



and among continents rather than specific clusters or 
clades [3,23,24]. 

We further aimed for the development of a feasible 
method to determine ancestry, as resources such as 
funding and available DNA are often limited for these 
applications. We therefore developed panels of AISNPs 
suitable for multiplex application on two commonly 
used platforms, the ABI SNPlex [25] and Sequenome 
iPLEX [26] systems. Additionally, all markers are also in- 
cluded on the Illumina HumanHap550 array, thus 
allowing for a combined analysis with studies genotyped 
on the Illumina whole-genome arrays. 

Lastly, we specifically focused on the applicability of 
our panel to determine the ancestry of subjects from any 
of the worldwide geographic origins. To date, most re- 
search involving genetic association studies has focused 
on populations of European descent, where longer LD 
blocks require fewer genetic markers to be genotyped 
[27]. However, current gene-mapping efforts specifically 
request more global research, thus increasing the need 
for global AIM panels. Furthermore, global ancestry 
determination is especially important in clinical samples 
ascertained in specific geographic regions such as 
Southern California that are inhabited by individuals 
with very diverse and often heavily admixed ancestries. 

Here we describe the development of AIM panels 
based on the well-studied global reference populations 
from the HGDP-CEPH [28], which include 52 geograph- 
ically diverse populations collected from seven continen- 
tal regions. We then greatly expanded the reference 
population set by genotyping the AIMs in over 2,000 
additional subjects of known ancestry with the goal of 
achieving the most comprehensive global reference col- 
lection possible. We report on these efforts and describe 
highly discriminative ancestry informative 41- and 31- 
marker panels for multiplex applications. 

Methods 

Reference populations 

AIM panels were developed based on the global refer- 
ence populations from the HGDP-CEPH [28]. A total of 
941 subjects including 52 populations from the stan- 
dardized H952 subset were selected [29]. Based on the 
geographic origin of the samples, HGDP subjects were 
assigned to one of seven geographic or continental 
regions: Africa {n = 131), the Middle East (including the 
North African Moabites, n = 133), Europe {n = 158), 
Central/South Asia (CS Asia, n = 198), East Asia (E Asia, 
n = 229), the Americas {n = 64) and Oceania {n = 28) 
(Additional file 1: Table SI). 

AIM panel development 

Genotypes of HGDP subjects from the Illumina 650Y SNP 
array are publicly available (http://hagsc.org/hgdp/files.html). 
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We used Infocalc 1.1 [30] to calculate the marker in- 
formativeness (I_n) among the seven continental re- 
gions for each of the 644,195 autosomal markers. The 
mean informativeness of all markers was 0.0539, with a 
wide range of I_n = 0.0003-0.406. ATMs were selected 
according to the following criteria: being autosomal, un- 
ambiguous (AC, AG, TC, TG) and present on the 
lUumina Hap550 array {n = 547,458). Next, the top 
5,000 markers with the highest I_n were chosen (I_n > 
0.077) and, to reduce the correlation of markers, were 
subjected to LD pruning using PLINK [31] at a VIF = 
1.5. The resulting pool of ATMs included 1,442 SNPs 
(Additional file 2: Table S2). 

A small panel for multiplexing applications was devel- 
oped by first choosing from the pool of 1,442 ATMs the 
top ten markers with the highest allele frequency differ- 
ences (5) between each of the 21 pairwise continental re- 
gion comparisons. This set of 210 markers was then 
further reduced in an iterative way by considering multi- 
plex genotyping requirements for the ABI SNPlex geno- 
typing system [25] and Sequenome iPLEX system [26], 
leading to the final 41-AIM set for ABI SNPlex genotyp- 
ing and the matching 31- AIM set for Sequenome iPLEX 
genotyping. 

Additional reference and test populations 

To validate the AIM panels and increase the global 
coverage of the reference population set for down- 
stream applications, we included two additional, very 
large data sets with worldwide populations: the Inter- 
national HapMap Project (http://hapmap.ncbi.nlm.nih. 
gov/; phase III release 2 and 3) standard set HAP1161 
[32] included 931 subjects from 11 populations, and 
the Yale data set included 2,146 subjects from 57 pop- 
ulations [33]. The combined reference set included 
4,018 unrelated subjects from 120 (partially overlap- 
ping) populations (Additional file 1: Table SI). These 
reference populations have been described previously 
[33], and geographic features such as latitude and 
longitude of these populations are presented in the 
allele frequency database ALFRED (http://alfred.med. 
yale.edu/) [34]. Genotypes of at least 40 of the 41 
AIMs were available for all reference subjects. 

Finally, to illustrate a practical application of the 41- 
AIM panel with our complete set of global reference 
populations, a contemporary population sample of 2,392 
subjects ascertained in Southern California [35] was ge- 
notyped using the ABI SNPlex system. Ancestry was de- 
termined for all subjects with < 5% genotypes missing. 

Statistical analyses 

Population structure and individual ancestry estimates 
were obtained using STRUCTURE v2.3.2.1. [36,37]. To 
assess the global informativeness of the 41-AIM panel 



in the original HGDP reference populations, five inde- 
pendent runs without prior population assignment 
were performed at K = 2 to K = 7, using 20,000 burn- 
in cycles and 20,000 MCMC replications under the 
admixture model. The "infer a" option with the same, 
uniform alpha for all populations was used under 
the \ = 1 option. All other parameters were set at 
default. 

To further validate the 41-AIM panel, ancestry esti- 
mates of 3,077 independent subjects of known ancestry 
from 68 global populations (reference set 2) were deter- 
mined at k = 7 using the above STRUCTURE parame- 
ters, but now including prior population information of 
the HGDP reference set. Allele frequencies were updated 
using only individuals with population information at a 
migration prior of 0.05. Graphs were plotted using 
DISTRUCTvl.l [38]. 

CLUMPP vl.1.2 [39] was used to evaluate different 
replicates of STRUCTURE runs. To assign a subject to a 
specific cluster, we applied cutoffs of >85% and >50% 
cluster membership, respectively. These criteria were se- 
lected to facilitate a comparison with Seldin's 93-AIM 
panel [10]. Finally, to validate the AIM panels, the per- 
centage of subjects that clustered correctly compared to 
the known geographic origin was calculated. 

Population structure was further analyzed using prin- 
cipal component analysis (PCA) implemented in the 
EIGENSTRAT software [40] and multi-dimensional scal- 
ing (MDS) as implemented in PLINK. All other calcula- 
tions were performed in R v2.15.0. 

As a measure of informativeness of the different AIM 
panels at the population level, we calculated Fst> a gen- 
etic distance measure for inter-population differentiation 
compared to intra-population variation. Significance of 
pairwise Fst« was established using 10,000 permutations. 
A Mantel test was used to correlate the Est matrices 
based on the 41-AIM and 31-AIM panels. Calculations 
were performed in ARLEQUIN 3.5 [41]. 

To investigate the informativeness of the AIM pa- 
nels in detecting admixture at the individual level, 
subjects from two admixed populations of the Southern 
California test population (self-reported African Americans 
and self-reported Hispanic White and Native Americans) 
were selected. These subjects were subjected to the 
lUumina HumanOmniExpressExome array, and indi- 
vidual ancestry estimates were determined with a sec- 
ond, independent approach (see [42] for details). In 
brief, we used over 10,000 GWAS-derived SNPs, a 
set of 2,513 (partly overlapping) reference individuals and 
a two-step analysis approach implemented in ADMIX- 
TURE [43]. Individual admixture estimates based on the 
GWAS-derived panel were then compared to the admix- 
ture estimates based on the 41- and 31-AIM panels for 
these two admixed populations (see above). 
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Results 

Characterization of small AIM panels to determine 
continental ancestry 

Fifty-two global populations from the HGDP-CEPH 
panel [28] were used to select ATMs optimized for the 
determination of continental ancestry. We developed a 
small 41 -AIM panel specifically for multiplex application 
on the ABI system from a pre-selected pool of 1,442 
highly informative ATMs (Additional file 1: Table SI). 
The panel was further reduced to 31 AIMs for applica- 
tion on the Sequenome iPLEX system. 

Table 1 shows the informativeness (I_n) and pairwise 
allele frequency differences (5) among the seven contin- 
ental regions for each of the 41 AIMs. I_n ranges from 
0.08 - 0.41 with a high mean of 0.23. The largest I_n and 
largest 5 for each of the 21 continental comparisons are 
indicated in bold, highlighting the strength of a marker 
to distinguish between specific different global origins. 
Most continental comparisons included several markers 
with very high 5 of >0.8. The smallest allele frequency 
differences were found for comparisons of regions 
within Eurasia where the top markers showed 5 in the 
range of 0.4, indicating limited power to accurately 
distinguish subjects from Europe, the Middle East and 
Central/South Asia from each other. 

The AIM panels were further characterized by calcu- 
lating Est [41] as a measure of the panels relative 
strength to distinguish the seven geographic regions. 
Table 2 shows the genetic distance between the contin- 
ental regions when using the 41 -AIM (lower diagonal) 
and 31- AIM panel (upper diagonal), respectively. Inter- 
continent differentiation was based on allele frequencies 
from 51 HGDP populations; the atypical North African 
Mozabites were excluded here. 

In general, we found high Est values distinguishing the 
African, East Asian, American and Oceanian regions. As 
expected, the lower Est values among Europe, the Mid- 
dle East and Central/South Asia reflect the I_n and 6 
found for the single markers. When comparing the Est 
values of the full 41 -AIM panel with the reduced 
31-AIM panel, no significant differences were found 
(Wilcoxon signed rank test, n = 21 paired comparisons, 
p > 0.38). In addition, a comparison of all pairwise Est 
values among the 52 populations showed a highly sig- 
nificant correlation among the Est values calculated 
based on the 41-AIM panel and the 31-AIM panel 
(Mantel test, r = 0.987, p < 0.001), further indicating no 
significant loss of power to discern global ancestry in the 
smaller panel. 

Lastly, the population structure of the HGDP was ana- 
lyzed using STRUCTURE. To facilitate a comparison 
with previous studies (e.g., [10,12,24,33,44]), we used 
similar model parameters without prior information 
about individual sampling locations. Eigure 1 shows the 



most typical patterns with the highest likelihood from 
each of 20 independent runs at K = 2-7. Similar to 
Rosenbergs analyses including 377 microsatellites [44] 
and 993 SNPs [24], we found stable results with two 
clusters anchored by Africa and the Americas at K = 2 
(20/20 runs) and a separation of Africa at K = 3 (19/20). 
At K = 4, a new cluster emerged isolating either the 
Americas (11/20) or alternatively Central/South Asia (9/ 
20), and at K = 5 both of these regions were isolated 
(14/20). Most runs separated Europe from the Middle 
East at K = 6 (17/20), and at K = 7 the main continental 
regions for whose partitioning the panel was designed 
were separated from each other in the majority of runs 
(11/20) and with the highest likelihood. 

Validation of the 41-AIM panel using additional 
populations of known origin 

We further tested the performance of the 41-AIM panel 
in a realistic setting and estimated the ancestry of 3,077 
test subjects from 68 regionally collected populations 
from the HapMap III and Yale collections. These test 
populations have been extensively characterized by us 
and others (see, e.g., [33] and [45]) and are well suited 
for this purpose. STRUCTURE was run with the HGDP 
as predefined reference populations at K = 7 (Yale sam- 
ples were not genotyped for rs27 17329). Table 3 shows 
the average cluster membership of individuals belonging 
to a specific population for each of the seven continental 
regions, Africa, the Middle East, Europe, Central/South 
Asia, East Asia, the Americas and Oceania {n = 68 pop- 
ulations). We calculated the percentage of subjects that 
clustered correctly, using criteria of >85% and >50% 
cluster membership (MS), respectively. 

We found that African populations had very high clus- 
ter membership in the African cluster, but East African 
populations (e.g., Chagga, Maasai and Sandawe) showed 
slightly lower values. As expected, admixed African 
Americans as well as a population of Ethiopian Jews 
showed some cluster membership in Europe and the 
Middle East, and less than 50% of the subjects were in- 
cluded in the African group at the 85% MS criteria. 

The ethnoreligious Samaritans, Yemenite Jews and 
Druze clustered with the Kuwaiti predominantly in the 
Middle East, but also showed a significant European 
contribution. As expected, most European populations 
clustered predominantly with Europe. However, there 
was a significant Middle Eastern component, even for 
the Northern European populations such as the Einns 
and Irish, demonstrating the somewhat reduced specifi- 
city of the 41-AIM panel to distinguish between Europe 
and the Middle East compared to the resolution between 
other continents. When applying the less stringent 50% 
MS criterion, most populations had over 90% of their 
subjects placed in Europe. Not surprisingly, the Russian 



Table 1 Informativeness (l_n) and allele frequency differences (6) between seven continental regions for SNPs on the 41 -AIM panel 

Continental regions: 1 = Africa, 2 = the Americas, 3 = Central/South Asia, 4 = East Asia, 5 = Europe, 6 = the Middle East, 7 = Oceania 



SNP 

rs 1834640 

rs9809818* 

rs3 10644 

rsl834619* 

rsl572018* 

rs7226659* 

rs260714* 

rs49 18664 

rs4471745 

rsl 172541 2 

rs3098610 

rs4664511 

rsl 0079352 

rs2 166624 

rsl 24981 38 

rs7251928 

rs6990312 

rs3823159 

rs2024566 

rs7722456 

rs9880567 

rs842639 

rsl 04971 91 

rs734241 

rs735480* 

rs2593595 

rs2717329 

rsl 0961 366* 

rsl 557553 

rs4741658* 

rs2 196051 

rs6737672 



i_n 

0.406 

0.364 

0.352 

0.334 

0.320 

0.318 

0.302 

0.287 

0.284 

0.272 

0.271 

0.268 

0.267 

0.262 

0.247 

0.245 

0.244 

0.242 

0.236 

0.235 

0.232 

0.231 

0.230 

0.227 

0.224 

0.218 

0.213 

0.213 

0.207 

0.202 

0.202 

0.199 



1-2 

0.367 

0.811 

0.909 

0.953 

0.930 

0.415 

0.064 

0.855 

0.097 

0.704 

0.761 

0.568 

0.975 

0.977 

0.906 

0.966 

0.457 

0.477 

0.903 

0.362 

0.567 

0.706 

0.883 

0.899 

0.459 

0.615 

0.762 

0.019 

0.887 

0.742 

0.012 

0.977 



1-3 
0.838 

0.355 

0.741 

0.280 

0.718 

0.055 

0.589 

0.285 

0.003 

0.047 

0.328 

0.143 

.0540 

0.323 

0.070 

0.778 

0.516 

0.859 

0.257 

0.268 

0.326 

0.340 

0.858 

0.211 

0.755 

0.688 

0.603 

0.254 

0.080 

0.458 

0.228 

0.585 



1-4 

0.090 
0.822 
0.925 

0.723 

0.340 

0.575 

0.107 

0.869 

0.023 

0.432 

0.849 

0.502 

0.957 

0.430 

0.090 

0.942 

0.787 

0.745 

0.012 

0.351 

0.441 

0.603 

0.893 

0.311 

0.597 

0.805 

0.710 

0.023 

0.273 

0.774 

0.002 

0.620 



1-5 
0.945 

0.057 
0.899 

0.073 
0.877 

0.002 

0.705 

0.147 

0.050 

0.186 

0.265 

0.043 

0.459 

0.383 

0.060 

0.741 

0.621 

0.904 

0.218 

0.175 

0.307 

0.014 

0.822 

0.003 

0.895 

0.753 

0.759 

0.254 

0.085 

0.225 

0.645 

0.516 



1-6 
0.918 

0.079 
0.812 

0.028 

0.724 

0.019 

0.601 

0.038 

0.104 

0.243 

0.399 

0.090 

0.312 

0.224 

0.040 

0.662 

0.505 

0.815 

0.239 

0.178 

0.353 

0.062 

0.768 

0.035 

0.815 

0.683 

0.520 

0.278 

0.045 

0.117 

0.493 

0.561 



1-7 

0.004 
0.972 

0.063 

0.714 

0.036 

0.950 

0.164 

0.174 

0.869 

0.421 

0.980 

0.833 

0.419 

0.000 

0.089 

0.787 

0.166 

0.371 

0.056 

0.569 

0.402 

0.674 

0.889 

0.464 

0.205 

0.990 

0.980 

0.623 

0.195 

0.630 

0.020 

0.321 



2-3 

0.471 

0.455 

0.168 

0.673 

0.212 

0.360 

0.653 

0.569 

0.094 

0.751 

0.433 

0.424 

0.434 

0.654 

0.836 

0.188 

0.060 

0.382 

0.647 

0.094 

0.241 

0.366 

0.024 

0.688 

0.296 

0.073 

0.159 

0.235 

0.806 

0.285 

0.216 

0.392 



2-4 

0.277 

0.012 

0.016 

0.230 

0.590 

0.160 

0.043 

0.014 

0.120 

0.272 

0.088 

0.066 

0.017 

0.546 

0.817 

0.024 

0.330 

0.268 

0.891 

0.011 

0.126 

0.103 

0.010 

0.589 

0.138 

0.189 

0.052 

0.042 

0.614 

0.032 

0.014 

0.357 



2-5 

0.578 

0.754 

0.010 

0.880 

0.053 

0.417 

0.769 

0.708 

0.048 

0.890 

0.496 

0.611 

0.516 

0.594 

0.846 

0.225 

0.164 

0.427 

0.685 

0.187 

0.260 

0.692 

0.061 

0.896 

0.436 

0.138 

0.003 

0.235 

0.802 

0.518 

0.633 

0.461 



2-6 

0.551 

0.731 

0.097 

0.926 

0.206 

0.396 

0.665 

0.816 

0.006 

0.947 

0.362 

0.658 

0.662 

0.753 

0.866 

0.304 

0.048 

0.338 

0.664 

0.184 

0.214 

0.768 

0.115 

0.864 

0.356 

0.068 

0.241 

0.259 

0.842 

0.626 

0.481 

0.415 



2-7 

0.371 

0.162 

0.846 

0.239 

0.894 

0.535 

0.099 

0.681 

0.967 

0.284 

0.219 

0.266 

0.556 

0.977 

0.817 

0.179 

0.623 

0.106 

0.959 

0.931 

0.969 

0.032 

0.007 

0.435 

0.255 

0.375 

0.219 

0.642 

0.692 

0.113 

0.031 

0.655 



3-4 
0.749 

0.467 

0.184 

0.443 

0.378 

0.520 

0.696 

0.583 

0.025 

0.479 

0.521 

0.358 

0.417 

0.108 

0.020 

0.164 

0.271 

0.114 

0.245 

0.084 

0.115 

0.263 

0.035 

0.100 

0.158 

0.117 

0.107 

0.277 

0.193 

0.317 

0.230 

0.035 



3-5 

0.107 
0.299 
0.158 
0.207 
0.159 
0.057 
0.116 
0.138 
0.047 
0.138 
0.063 
0.187 
0.082 
0.060 
0.010 
0.037 
0.104 
0.044 
0.039 
0.093 
0.019 
0.326 
0.037 
0.208 
0.140 
0.065 
0.156 
0.000 
0.004 
0.233 
0.417 
0.069 



3-6 

0.080 

0.276 

0.071 

0.252 

0.006 

0.036 

0.012 

0.247 

0.101 

0.196 

0.071 

0.233 

0.228 

0.099 

0.030 

0.116 

0.012 

0.045 

0.017 

0.089 

0.027 

0.402 

0.091 

0.176 

0.060 

0.004 

0.083 

0.024 

0.036 

0.341 

0.265 

0.024 



3-7 
0.842 

0.617 

0.678 

0.434 

0.682 

0.894 

0.752 

0.111 

0.872 

0.468 

0.652 

0.690 

0.121 

0.323 

0.019 

0.009 

0.683 

0.488 

0.312 

0.836 

0.728 

0.334 

0.031 

0.253 

0.551 

0.303 

0.378 

0.876 

0.114 

0.172 

0.248 

0.264 



4-5 
0.855 

0.765 

0.026 

0.650 

0.537 

0.577 

0.812 

0.722 

0.072 

0.617 

0.584 

0.545 

0.499 

0.047 

0.029 

0.201 

0.166 

0.158 

0.206 

0.176 

0.134 

0.589 

0.072 

0.308 

0.298 

0.052 

0.049 

0.277 

0.188 

0.549 

0.647 

0.104 



4-6 
0.828 

0.743 

0.114 

0.695 

0.384 

0.556 

0.708 

0.830 

0.126 

0.675 

0.450 

0.591 

0.645 

0.206 

0.050 

0.280 

0.282 

0.070 

0.228 

0.173 

0.088 

0.665 

0.126 

0.276 

0.218 

0.121 

0.189 

0.301 

0.228 

0.658 

0.495 

0.059 



4-7 

0.093 
0.150 
0.862 

0.008 

0.304 

0.375 

0.056 

0.695 

0.847 

0.011 

0.131 

0.332 

0.539 

0.430 

0.000 

0.155 

0.953 

0.374 

0.067 

0.920 

0.843 

0.071 

0.004 

0.153 

0.393 

0.186 

0.271 

0.600 

0.078 

0.144 

0.018 

0.299 



5-6 


5-7 


6-7 


0.027 


0.949 


0.921 


0.023 


0.916 


0.893 


0.087 


0.836 


0.749 


0.045 


0.642 


0.687 


0.153 


0.841 


0.688 


0.021 


0.952 


0.931 


0.104 


0.868 


0.764 


0.109 


0.027 


0.136 


0.054 


0.919 


0.973 


0.058 


0.606 


0.664 


0.134 


0.715 


0.581 


0.047 


0.877 


0.923 


0.146 


0.040 


0.107 


0.159 


0.383 


0.224 


0.020 


0.029 


0.049 


0.079 


0.046 


0.125 


0.116 


0.787 


0.671 


0.089 


0.533 


0.444 


0.022 


0.273 


0.295 


0.003 


0.744 


0.747 


0.046 


0.709 


0.755 


0.076 


0.660 


0.735 


0.054 


0.068 


0.122 


0.032 


0.461 


0.429 


0.080 


0.691 


0.611 


0.069 


0.237 


0.307 


0.239 


0.222 


0.460 


0.024 


0.877 


0.900 


0.040 


0.110 


0.150 


0.108 


0.405 


0.513 


0.152 


0.665 


0.512 


0.046 


0.194 


0.240 



ID J:^ 
n g 

S § 



O 



Table 1 Informativeness (l_n) and allele frequency differences (5) between seven continental regions for SNPs on the 41 -AIM panel (Continued) 



rsl 0877030 


0.185 


0.472 


0.082 


0.231 


0.197 


0.318 


0.460 


0.390 


0.241 


0.669 


0.790 


0.012 


0.149 


0.279 


0.400 


0.378 


0.428 


0.549 


0.229 


0.122 


0.657 


0.779 


rs7837234 


0.162 


0.113 


0.003 


0.271 


0.039 


0.044 


0.619 


0.110 


0.158 


0.152 


0.069 


0.732 


0.268 


0.042 


0.041 


0.622 


0.310 


0.227 


0.890 


0.083 


0.580 


0.663 


rs4907251 


0.157 


0.843 


0.550 


0.528 


0.205 


0.242 


0.475 


0.292 


0.315 


0.638 


0.600 


0.367 


0.023 


0.346 


0.308 


0.075 


0.323 


0.286 


0.052 


0.038 


0.271 


0.233 


rsl 863086 


0.149 


0.658 


0.133 


0.214 


0.255 


0.116 


0.310 


0.525 


0.872 


0.403 


0.542 


0.348 


0.347 


0.122 


0.017 


0.177 


0.469 


0.330 


0.524 


0.139 


0.055 


0.194 


rs3 10362* 


0.147 


0.763 


0.476 


0.533 


0.627 


0.279 


0.751 


0.287 


0.229 


0.136 


0.484 


0.011 


0.058 


0.151 


0.197 


0.276 


0.094 


0.255 


0.218 


0.349 


0.124 


0.473 


rs4705360 


0.139 


0.701 


0.300 


0.397 


0.047 


0.011 


0.419 


0.401 


0.304 


0.654 


0.712 


0.282 


0.097 


0.253 


0.312 


0.119 


0.350 


0.409 


0.021 


0.058 


0.372 


0.430 


rs4833103 


0.110 


0.016 


0.038 


0.009 


0.449 


0.104 


0.000 


0.022 


0.007 


0.434 


0.089 


0.016 


0.029 


0.412 


0.067 


0.038 


0.441 


0.096 


0.009 


0.345 


0.449 


0.104 


rs359955* 


0.090 


0.455 


0.533 


0.613 


0.639 


0.298 


0.410 


0.078 


0.158 


0.183 


0.157 


0.046 


0.080 


0.106 


0.235 


0.124 


0.026 


0.315 


0.204 


0.341 


0.229 


0.111 


rsl 28781 66 


0.078 


0.137 


0.220 


0.413 


0.098 


0.259 


0.038 


0.356 


0.549 


0.039 


0.396 


0.174 


0.193 


0.317 


0.039 


0.182 


0.510 


0.154 


0.375 


0.356 


0.135 


0.221 



*Not on the 31 -AIM panel. 
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Table 2 Pairwise Fst values among the seven continental regions calculated based on allele frequencies of 41 AIMs 
(below diagonal) and 31 AIMs (above diagonal) 

Africa Middle East Europe CS Asia East Asia Americas Oceania 



Africa 




0.439 


0.456 


Middle East 


0.457 




0.043 


Europe 


0.498 


0.054 




CS Asia 


0.417 


0.080 


0.086 


East Asia 


0.564 


0.395 


0.391 


Americas 


0.712 


0.531 


0.501 


Oceania 


0.632 


0.552 


0.543 



All FST values are significant at p < 0.0001. 



populations Adygei, Chuvash, Komi Zyriane and Russian 
Vologda were found to have a significant Central/South 
Asian component. 

The Central/ South Asian cluster included the majority 
of the Gujarati, Keralite and Thoti Indians at the 50% 
MS criterion. As expected, the Kachari Assam, located 
in the East, also showed a significant East Asian contri- 
bution. However, there was no predominant placing in 
any of the seven continental groupings for the Khanty, a 
population from western Siberia. This is expected since 
the current continental grouping at K = 7 does not 
include a specific Siberian/North Asian cluster. The 
Khanty are currently our only representatives of this 
large geographic area. 

The East Asian test subjects from 15 diverse popula- 
tions clustered in East Asia with almost no exception. 
Most Southern Malaysians also showed some Central/ 
South Asian contribution. Most Native American pop- 
ulations clustered predominantly in the Americas. 



0.401 


0.581 


0.723 


0.594 


0.076 


0.365 


0.535 


0.501 


0.087 


0.358 


0.496 


0.479 




0.206 


0.369 


0.372 


0.232 




0.373 


0.466 


0.353 


0.305 




0.617 


0.411 


0.398 


0.555 





Exceptions were the admixed Muscogee and HapMap 
Mexicans, which were not placed in this cluster, but 
showed a strong European component. The Oceanic 
cluster included all Papua-New Guinean and Nasioi 
Melanesian subjects. However, the Micronesian and 
Samoan subjects from this broad geographic area were 
not assigned to Oceania at the 50% MS criterion, but 
were found to be admixed with a strong East Asian 
component. 

Finally, we combined the HapMap III and Yale col- 
lections with the HGDP, and further analyses were 
conducted with our complete reference population set 
including 4,018 subjects genotyped on the 41 -AIM 
panel. A principal component analysis (PCA) including 
all 4,018 subjects and averaged for each of the 120 popu- 
lations is shown in Figure 2. We found that the first PC 
explained 27.6% of the genetic variability in the data set 
and corresponded with the Africa to Americas gradient 
found by STRUCTURE at K = 2. PC2 explained an 




Table 3 Continental ancestry based on STRUCTURE analysis of 68 test populations genotyped on the 41 -AIM panel with HGDP subjects included as reference 
populations 



Population 


Africa 


Middle 
East 


Europe 


CS 
Asia 


E 

Asia 


Americas 


Oceania 


85% 
MS* 


50% 
MS* 


Population 


Africa 


Middle 
East 


Europe 


CS 
Asia 


E 

Asia 


Americas 


Oceania 


85% 
MS* 


50% 
MS* 


Africa (n = 


823) 


















CS. Asia (n 


= 142) 


















MBU 


0.988 


0.002 


0.002 


0.002 


0.002 


0.001 


0.003 


1 




GIH 


0.010 


0.056 


0.032 


0.81 1 


0.042 


0.026 


0.024 


0.61 


0.93 


BIA 


0.988 


0.003 


0.002 


0.002 


0.002 


0.001 


0.002 


1 




KER 


0.01 0 


0.064 


0.044 


0.805 


0.044 


0.021 


0.01 3 


0.50 


0.93 


IBO 


0.975 


0.005 


0.004 


0.004 


0.004 


0.003 


0.005 


1 




THT 


0.040 


0.035 


0.026 


0.696 


0.065 


0.061 


0.078 


0.21 


0.79 


YOR 


0.974 


0.005 


0.005 


0.005 


0.004 


0.003 


0.004 


1 




KCH 


0.010 


0.053 


0.040 


0.590 


0.209 


0.065 


0.034 


0.19 


0.63 


YRI 


0.969 


0.006 


0.005 


0.007 


0.005 


0.004 


0.005 


0.99 




Siberia (n = 


47) 


















ZRM 


0.957 


0.010 


0.008 


0.010 


0.006 


0.004 


0.006 


0.95 




KTY 


0.007 


0.076 


0.224 


0.198 


0.330 


0.145 


0.020 


0 


0.06 


LSG 


0.947 


0.008 


0.009 


0.008 


0.007 


0.007 


0.015 


1 




East Asia (n 


= 680) 


















LWK 


0.941 


0.017 


0.012 


0.012 


0.007 


0.005 


0.006 


0.95 




ATL 


0.003 


0.004 


0.005 


0.007 


0.965 


0.01 1 


0.005 


1 




HAS 


0.930 


0.020 


0.012 


0.015 


0.010 


0.007 


0.008 


0.86 




CHB 


0.004 


0.005 


0.005 


0.007 


0.963 


0.010 


0.007 


1 




CGA 


0.889 


0.036 


0.025 


0.026 


0.010 


0.006 


0.008 


0.71 




CHS 


0.006 


0.006 


0.005 


0.009 


0.960 


0.008 


0.005 


1 




MAS 


0.844 


0.060 


0.039 


0.029 


0.01 1 


0.010 


0.006 


0.45 




JPN 


0.005 


0.005 


0.004 


0.006 


0.967 


0.008 


0.005 


1 




SND 


0.841 


0.050 


0.030 


0.038 


0.023 


0.009 


0.010 


0.47 




CHT 


0.006 


0.004 


0.004 


0.007 


0.961 


0.012 


0.007 


0.98 




AAM 


0.781 


0.069 


0.070 


0.039 


0.01 7 


0.012 


0.012 


0.44 


0.91 


KOR 


0.004 


0.005 


0.004 


0.007 


0.964 


0.010 


0.006 


0.98 




ASW 


0.761 


0.057 


0.092 


0.042 


0.01 7 


0.023 


0.008 


0.24 


1 


AMI 


0.005 


0.005 


0.004 


0.008 


0.966 


0.005 


0.007 


0.97 




MKK 


0.748 


0.120 


0.051 


0.055 


0.01 1 


0.008 


0.007 


0.22 


0.99 


CHD 


0.005 


0.006 


0.006 


0.008 


0.957 


0.010 


0.008 


0.97 




ETH 


0.430 


0.402 


0.096 


0.055 


0.006 


0.004 


0.007 


0 


0.39 


HKA 


0.004 


0.005 


0.005 


0.008 


0.956 


0.013 


0.008 


0.97 




Middle East (n = 165) 
















JPT 


0.005 


0.006 


0.005 


0.008 


0.957 


0.016 


0.005 


0.97 




YMJ 


0.012 


0.745 


0.166 


0.061 


0.007 


0.004 


0.004 


0.47 


0.85 


LAO 


0.012 


0.01 1 


0.009 


0.015 


0.930 


0.01 1 


0.012 


0.91 




KWT 


0.029 


0.685 


0.072 


0.158 


0.024 


0.016 


0.016 


0.36 


0.79 


MVP 


0.007 


0.014 


0.012 


0.019 


0.924 


0.018 


0.006 


0.90 




SAM 


0.004 


0.648 


0.269 


0.058 


0.009 


0.009 


0.003 


0.32 


0.76 


CBD 


0.01 7 


0.034 


0.021 


0.022 


0.878 


0.009 


0.018 


0.85 




DRU-1 


0.009 


0.531 


0.353 


0.087 


0.006 


0.005 


0.009 


0.25 


0.53 


YAK 


0.008 


0.01 7 


0.022 


0.021 


0.895 


0.031 


0.007 


0.67 




Europe (n = 


= 853) 


















MLY 


0.01 7 


0.024 


0.041 


0.057 


0.765 


0.014 


0.082 


0.40 


0.90 


FIN 


0.005 


0.036 


0.870 


0.053 


0.014 


0.015 


0.007 


0.66 


0.97 


Americas (n 


= 298) 


















IRI 


0.005 


0.099 


0.841 


0.036 


0.006 


0.006 


0.007 


0.72 


0.92 


KAR 


0.002 


0.001 


0.001 


0.002 


0.003 


0.988 


0.002 


1 




EAM 


0.005 


0.112 


0.818 


0.046 


0.008 


0.007 


0.005 


0.68 


0.89 


SUR 


0.001 


0.002 


0.002 


0.002 


0.003 


0.988 


0.002 


1 




DAN 


0.007 


0.116 


0.811 


0.045 


0.006 


0.011 


0.005 


0.69 


0.90 


PMM 


0.005 


0.007 


0.007 


0.011 


0.021 


0.943 


0.006 


0.95 




CEU 


0.007 


0.111 


0.806 


0.055 


0.009 


0.007 


0.005 


0.63 


0.90 


COL-1 


0.003 


0.007 


0.008 


0.017 


0.011 


0.947 


0.006 


0.92 




RUA 


0.005 


0.076 


0.798 


0.078 


0.016 


0.017 


0.009 


0.50 


0.93 


TIC 


0.008 


0.015 


0.009 


0.033 


0.020 


0.904 


0.010 


0.73 




HGR 


0.005 


0.154 


0.766 


0.051 


0.010 


0.007 


0.006 


0.51 


0.84 


QUE 


0.021 


0.035 


0.040 


0.026 


0.034 


0.836 


0.009 


0.55 




RUV 


0.005 


0.130 


0.689 


0.136 


0.013 


0.016 


0.011 


0.44 


0.76 


MAY 


0.029 


0.041 


0.027 


0.044 


0.028 


0.815 


0.016 


0.52 


0.97 
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Table 3 Continental ancestry based on STRUCTURE analysis of 68 test populations genotyped on the 41 -AIM panel with HGDP subjects included as reference 
populations (Continued) 



TSI 


0.006 


0.271 


0.666 


0.044 


0.005 


0.005 


0.003 


0.30 


0.71 


MUS 


0.010 


0.094 


0.304 


0.075 


0.031 


0.470 


0.015 


0.10 


0.50 


KMZ 


0.005 


0.064 


0.661 


0.172 


0.037 


0.038 


0.023 


0.18 


0.80 


MEX 


0.024 


0.179 


0.248 


0.164 


0.032 


0.344 


0.009 


0 


0.26 


ASH 


0.007 


0.329 


0.549 


0.095 


0.010 


0.006 


0.005 


0.22 


0.58 


Oceania (n 


= 69) 


















ADY 


0.005 


0.184 


0.540 


0.241 


0.016 


0.010 


0.005 


0.22 


0.58 


PNG-1 


0.006 


0.003 


0.003 


0.003 


0.009 


0.004 


0.972 


1 


1 


SRD-1 


0.006 


0.398 


0.538 


0.038 


0.009 


0.006 


0.005 


0.30 


0.58 


NAS 


0.006 


0.004 


0.004 


0.005 


0.015 


0.006 


0.960 


0.92 


1 


CHV 


0.005 


0.110 


0.535 


0.224 


0.084 


0.023 


0.019 


0.22 


0.51 


MCR 


0.020 


0.014 


0.008 


0.019 


0.680 


0.010 


0.249 


0 


0.14 


RMJ 


0.005 


0.490 


0.453 


0.039 


0.007 


0.003 


0.003 


0.13 


0.46 


SMO 


0.018 


0.031 


0.034 


0.056 


0.694 


0.016 


0.151 


0 


0 



*85% MS (50% MS): percent of subjects with at least 85% (or 50%, respectively) membership in the geographically pre-assigned continental cluster. 
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additional 16.8% variability and added a European com- 
ponent (Panel A). Of note is the misleading positioning 
of the admixed HapMap Mexicans (MEX) and Native 
American Muscogee (MUS), both falling within the East 
Asian cluster. Adding PCS, which accounted for another 
6.2% of the genetic variability and includes the Native 
American component, resolved the structure and cor- 
rectly placed the MEX and MUS between Europe and 
the Americas (Panel B). 

We performed an analysis of the eigenvalues of the first 
15 PCs and found that over 56% of the genetic variation 
among the seven continental regions was accounted for by 
the first five PCs (Figure 3). 

Applications of the 41 -AIM panel 

To highlight a practical application of the 41 -AIM panel 
and our large collection of reference populations, we 
considered the case of a genetic association study with 
subjects collected in Southern California. In order to 
minimize spurious results due to population stratifica- 
tion (i.e., false-positive associations between a phenotype 
and genetic marker), a PCA is often applied. PCs can be 
used as an easy tool to visualize large amounts of data 
or can be included as covariates in association analyses 
to adjust for population stratification. 

PC plots of the first three PCs generated based on 
genotype data of the 41 AIMs are shown in Figure 4 for 



the complete reference set of 4,018 subjects from 120 
populations. When placed in the context of clusters, sev- 
eral populations appear as admixed among the eight col- 
ored continental regions (see Table 3; Siberia has been 
added as its own region here) or are truly admixed (such 
as the African Americans, Mexicans, Mozabites and 
others). To increase resolution, we removed these popu- 
lations {n = 13, open symbols) in specific applications. 

Figure 5 shows PC plots of Southern Californian test 
subjects and 107 'typical' reference populations. The first 
five PCs (PCI - PC5) explain a total of 51.5% variability 
in the data, and each identifies different aspects of the 
population distribution. PC2 highlights the European- 
African gradient and identified African Americans (panel 
A), PC3 added the native American component and sep- 
arated Mexican Americans from Central/South Asians 
(panel B), and PC4 separated Oceania (panel C). Corre- 
sponding to the small eigenvalue of the fifth PC (see 
Figure 3), PC5 explained only a small fraction of the 
genetic variability (2.1%) in this setting and did not lead 
to a strong separation of the eight geographic clusters 
(panel D). However, PC5 was found to show a North- 
south cline in Eurasian populations, as indicated by 
a significant correlation of PC5 values with the aver- 
age latitude of 77 Eurasian populations (Spearmans 
p = 0.62, p < 0.001). An often-appUed alternative 
to the PCA is the multidimensional scaling (MDS) 
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Figure 2 Principal component analysis (PCA) based on genotype data of 41 AIMs including 4,018 subjects from 120 populations from 
the HGDP, HapMap and Yale collections. Individual values of subjects belonging to the same population are averaged to highlight the relative 
location of specific populations. 
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Figure 3 Eigenvalues of tiie first 15 principal components (PCs) 
indicating that most genetic variation among the seven 
continental regions captured by the 41 -AIM panel is accounted 
for by the first 5 PCs. 



approach implemented in the genetic association software 
PLINK. MDS analyses lead to essentially the same results 
(Additional file 3: Figure SI). 

Guided by these visual approaches, subjects are then 
typically grouped into a small number of more homoge- 
neous groups (e.g., European Americans or African 
Americans) prior to association analysis, using clustering 
methods such as implemented in STRUCTURE. Add- 
itional population stratification and varying degrees of 
individual admixture are then accounted for within these 
more homogeneous groups. 

To assess the informative ness of the AIM panels in 
detecting admixture at the individual level, we compared 
STRUCTURE admixture estimates based on the 41 and 
31 ATMs with independently derived estimates based on 
a large, G WAS -derived panel (see Methods). Subjects 
were selected based on self-report from the admixed 
African American and Hispanic White and Native 
American populations (Figure 6). Individual ancestry 
proportions derived from the 41 -AIM and GWAS panels 
were strongly correlated for both the Hispanic White 
and Native American populations {n = 484, Pearsons 
r = 0.81 for the proportion of Native American ancestry 
in panel A, r = 0.81 for the proportion of European an- 
cestry in panel C) and the African Americans {n = 106, 
Pearson s r = 0.86 for the proportion of African ancestry 
in panel B, r = 0.85 for the proportion of European an- 
cestry in panel D; all ^ < 2 x 10'^^). Slightly reduced cor- 
relations of individual admixture estimates were 
achieved between the GWAS -derived and smaller 31- 
AIM panels (A: r = 0.77, B: r = 0.78, C: r = 0.84 and D: 
r = 0.85, respectively). Importantly, a detailed inspection 



of the scatter plots indicates that the AIM panels lack 
sensitivity in detecting admixture in individuals with low 
proportions of admixture (in the range of < 20-25%) 
when compared to the GWAS -derived panel. 

Discussion 

Application and limitations 

Our motivation was to develop a feasible method to dis- 
cern continental ancestry that would enable a safeguard 
against the impact of population stratification in small 
genetic association studies where limited resources pre- 
clude large genotyping efforts. We achieved this by 
choosing a very small set of highly discriminative AIMs 
suitable for multiplexing applications, thus enabling 
lower cost and higher throughput. To ensure a wide ap- 
plication potential, we optimized our panel for two com- 
monly used multiplex platforms, the ABI SNPlex [25] 
and Sequenome iPLEX systems [26]. At the same time, 
these AIMs perform well in single SNP TaqMan assays 
and can also be extracted from whole-genome arrays 
such as the lUumina HumanHap550 chip, thus allowing 
an easy combination of samples with genotyping from 
different sources. This is especially important for AIMs, 
where imputation of SNPs based on information from 
genotyped markers is not advisable. 

Our panel was able to accurately discern the global an- 
cestry of a large majority of subjects originating from 
one of the seven specific ancestral clusters. This was the 
case for both the full 41 -AIM panel and the subset of 31 
AIMs, indicating that a balanced reduction of markers 
in these small panels did not significantly impact the ro- 
bustness of the results. A direct comparison of our find- 
ings with a previously published small panel of 93 AIMs 
published by Seldins group [10] showed 89.7% agree- 
ment in continental assignment of HGDP subjects (data 
not shown), further validating our panel. 

Not surprisingly, the biggest limitation to imputing 
global ancestry was found for subjects from Eurasia, 
where low Est values of 0.06 - 0.09 among Europe, the 
Middle East and Central/ South Asia indicated little gen- 
etic diversity. The clinal distributions of allele frequen- 
cies between Europe and East Asia pose a challenge 
for the identification of highly discriminative markers, a 
limitation also impacting other small AIM panels 
[12,46]. We therefore suggest supplementing our panel 
with additional high-resolution markers for studies with 
focus on Eurasia. Such markers suitable for discerning 
specific pairs of regions can easily be extracted from our 
extensive preselected list of global AIMs. 

Impact of reference populations 

Independent of the statistical method used to determine 
ancestry and admixture proportions, the results of these 
analyses depend not only on the informativeness of the 
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explained ranges from 27.6% for PCI to 6.2% for PCS. 



genetic markers, but also strongly on the set of reference 
populations included. An omission of reference subjects 
from an ancestral group likely leads to misclassification 
of test subjects with similar ancestries. For example, 
we previously found that African Americans clustered 
strongly with Central Asians in a three-way admixture 
analysis (erroneously) including only reference subjects 
from Europe, Central Asia and the Americas. 

We considered this crucial issue during panel develop- 
ment and leveraged the publicly available 52 HGDP pop- 
ulations collected across the globe [47]. We then 
increased our reference set by leveraging AIM frequen- 
cies from the HapMap III and performing additional 
AIM genotyping in our large global collection [33]. With 
over 4,000 subjects from 120 global populations, we thus 
assembled one of the largest reference sets published for 
the purpose of ancestry determination. However, specific 
regions such as Siberia are still underrepresented, and 
efforts to expand our reference subject collection are 
ongoing. 

Admixed subjects 

Whereas ancestry assignment of subjects from a specific 
geographic area represented by a cluster in the reference 
population set is a quantifiable and relatively straightfor- 
ward task, admixed subjects resulting from ancient or 



recent contact of populations with distinct ancestries 
pose challenges. 

If such a cohort consists of admixed subjects with 
known ancestry contributions, such as two-way admixed 
Mexican Americans collected from a distinct area in 
Southern California, the varying degree of European and 
Native American ancestries can easily be estimated in 
admixture analyses implemented in Statistical packages 
such as STRUCTURE, ADMIXTURE [43] or BAPS [48]. 
Our AIM panels were able to detect admixture in indi- 
viduals from these populations, but as expected for such 
small panels, were less sensitive when individual admix- 
ture proportions were low. 

However, in a clinical sample including subjects of un- 
known ancestral origin and complex population structure, 
as is often the case in our studies (e.g., [35,49,50]), the 
presented methods may lack specificity to distinguish be- 
tween admixture of distinct populations and erroneously 
place admixed subjects together with intermediate popula- 
tions. This is especially true when including only the first 
few components of multivariate data reduction methods 
such as MDS and PC analyses (see, e.g.. Figure 2, Panel A, 
for Mexican Americans clustering with Central/ South 
Asians). In these cases, adding demographic information 
such as self-declared race and ethnicity information is 
strongly suggested to help minimize misassignments. 
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Challenges for genetic association studies 

Adding to the complexities of accurately differentiating 
ancestral groups and estimating admixture proportions 
is the appropriate incorporation of this information into 
the design of genetic association studies. While the 
negative impact of population structure on association 
studies is well known [6], and methods to control for it 
are established and now routinely applied to studies of 
relatively homogeneous cohorts such as typically col- 
lected for GWAS (see e.g. a recent review [5]), the 



situation remains challenging for heterogeneous clinical 
collections or epidemiological cohorts. 

Depending on the composition and relative numbers 
of subjects from different ancestral backgrounds, com- 
mon questions in such studies include the genetic defin- 
ition of African Americans, which typically show degrees 
of European admixture that vary among individuals [51]. 
There is currently no consensus for an appropriate cut- 
off point between European Americans and African 
Americans. Even less trivial is the incorporation into 
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association studies of three-way admixed subjects such 
as Caribbean Latinos originating from Puerto Rico and 
the Dominican Republic [52], typically showing both 
Native American and high levels of African ancestry. 

For practical purposes, we often employ a multi-tier 
approach: we first group subjects into continental clus- 
ters using a majority criterion with statistical methods 
such as STRUCTURE and then confirm the plausibility 



of the grouping with demographic data, where available. 
Next, we aim to place most subjects into a very small 
number of clusters including genetically similar subjects, 
for example, by combining similar continental groups 
such as from Eurasia, and excluding outliers of minority 
ancestries. Lastly, we control for additional population 
stratification within clusters by incorporating MDS com- 
ponents into association studies and ultimately combine 
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results in meta-analyses, where appropriate. Such a 
method was, for example, employed for the Southern 
Californian population sample presented here, which 
encompassed a wide array of self-declared ethnic groups. 
Our approach resulted in a four-cluster analysis with 
61% European Americans, 18% subjects with Native 
American admixture, 7% subjects with African admix- 
ture, and 15% subjects of other ancestry and/or complex 
admixtures. 

Conclusion 

In conclusion, we demonstrated the utility and limita- 
tions of a small AIM panel specifically developed to dis- 
cern global ancestry. We believe that it will find wide 
application because of its feasibility and potential for a 
wide range of applications. To allow this reference set to 
be readily accessible for others to use, we are entering 
the allele frequencies for these 41 SNPs into ALFRED 
(alfred.med.yale.edu) [34] as an "SNP Set." To allow 
ready estimation of likelihoods of ancestry of individuals, 
these SNPs are also being entered as an additional 
AISNP Panel in FROG-kb (frog.med.yale.edu) [53]. 

Additional files 



Additional file 1: Table 51. Geographic sampling location, population 
name, number of subjects and source of genotype data of 120 reference 
populations. 

Additional file 2: Table S2. Chromosomal position (GRCh37.p5), alleles 
and informativeness (l_n) of 1,442 continental AIMs and sequence 
information for the multiplex 41 -AIM and 31 -AIM panels. 

Additional file 3: Figure SI. MDS plots of the first five MDS 
components for a visual inspection of a large population sample 
collected in Southern California (black). Subjects from 107 typical 
reference populations are color coded. 
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